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Abstract 

The majority of the human genome consists of repeated sequences. An important type of repeated sequences 
common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in 
the sequence AGTC TGTG C, TGTG is a tandem repeat, that may be generated from AGTGTGG by a tandem 
duplication of length 2. In this work, we investigate the possibility of generating a large number of sequences from 
a seed, i.e. a small initial string, by tandem duplications of bounded length. We study the capacity of such a system, 
a notion that quantifies the system’s generating power. Our results include exact capacity values for certain tandem 
duplication string systems. In addition, motivated by the role of DNA sequences in expressing proteins via RNA 
and the genetic code, we define the notion of the expressiveness of a tandem duplication system as the capability of 
expressing arbitrary substrings. We then completely characterize the expressiveness of tandem duplication systems for 
general alphabet sizes and duplication lengths. In particular, based on a celebrated result by Axel Thue from 1906, 
presenting a construction for ternary square-free sequences, we show that for alphabets of size 4 or larger, bounded 
tandem duplication systems, regardless of the seed and the bound on duplication length, are not fully expressive, i.e. 
they cannot generate all strings even as substrings of other strings. Note that the alphabet of size 4 is of particular 
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interest as it pertains to the genomic alphabet. Building on this result, we also show that these systems do not have 
full capacity. In general, our results illustrate that duplication lengths play a more significant role than the seed in 
generating a large number of sequences for these systems. 
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Index Terms 

Capacity, expressiveness, tandem repeats, tandem duplication, finite automaton, irreducible string^ 

I. Introduction 

More than 50% of the human genome consists of repeated sequences ||^. Two important types of common repeats 
are i) interspersed repeats and ii) tandem repeats. Interspersed repeats are caused by transposons. A transposon 
(jumping gene) is a segment of DNA that can copy or cut and paste itself into new positions of the genome. 
Tandem repeats are caused by slipped-strand mispairings m- Slipped-strand mispairings occur when one DNA 
strand in the duplex becomes misaligned with the other. 

Tandem Repeats are common in both prokaryote and eukaryote genomes. They are present in both coding 
and non-coding regions and are believed to be the cause of several genetic disorders. The effects of tandem 
repeats on several biological processes is understood by these disorders. They can result in generation of toxic or 
malfunctioning proteins, chromosome fragility, expansion diseases, silencing of genes, modulation of transcription 
and translation fT^ and rapid morphological changes g). 

A process that leads to tandem repeats, e.g. through slipped-strand mispairing, is called tandem duplication, which 
allows substrings to be duplicated next to their original position. For example, from the sequence AGTCGTCGCT, 
a tandem duplication of length 2 can give AGTCGTCGCGCT, which, if followed by a duplication of length 
3 may give AGTGG TGGTGG GGGT. The prevalence of tandem repeats and the fact that much of our unique 
DNA likely originated as repeated sequences @ motivates us to study the capacity and expressiveness of string 
systems with tandem duplication, as defined below. 

The model of a string duplication system consists of a seed, i.e. a starting string of finite length, a set of duplication 
rules that allow generating new strings from existing ones, and the set of all sequences that can be obtained by 
applying the duplication rules to the seed a finite number of times. The notion of capacity, introduced in 0 and 
defined more formally in the sequel, represents the average number of m-ary symbols per sequence symbol that 
are asymptotically required to encode a sequence in the string system, where m is the alphabet size (for DNA 
sequences the alphabet size is 4). The maximum value for capacity is 1. A duplication system k fully expressive 
if all strings with the alphabet appear as a substring of some string in the system. As we will show, if a system is 
not fully expressive, then its capacity is strictly less than 1. 

Before presenting the notation, definitions, and the results more formally, in the rest of this section, we present 
two simple examples to illustrate the notions of expressiveness and capacity for tandem duplication string systems. 
Furthermore, we also outline some useful tools as well as some of the results of the paper. 

'This paper was presented in part at IEEE International Syposium on Information Theory (ISIT), 2015 
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Fig. 1. Finite automaton for the systems S = ({0,1}, 01,7^'^’*), where k > 2, including the system of Example Notation used here is 
described in detail in Section UTl 

Example 1. Consider a string system on the binary alphabet E = {0,1} with 01 as the seed that allows tandem 
duplications of length up to 2. It is easy to check that the set of strings generated by this system start with 0 and 
end with 1. In fact, it can be proved that all binary strings of length n which start with 0 and end with 1 can be 
generated by this system. The proof is based on the fact that every such string can be written as 0’"'^ ... 0’'”“^ 1’"”, 

where each r.i > 1 and v is even. A natural way to generate this string is to duplicate 01 | times and then duplicate 
the Os and Is as needed via duplications of length 1. 

Expressiveness: From the preceding paragraph, every binary sequence s can be generated as a substring in this 
system as Osl. For example, although 11010 cannot be generated by this system, it can be generated as a substring 
of 0110101 in the following way; 

01 ^ 0101 ^ 010101 ^ 0 11010 1 . 

Hence this system is fully expressive. 

Capacity: The number of length-n strings in this system is 2"“^. Thus, encoding sequences of length n in this 
system requires n — 2 bits. The capacity, or equivalently the average number of bits (since the alphabet E is of size 
2) per symbol, is thus equal to 1. This is not surprising as the system generates almost all binary sequences. □ 

Observing these facts for an alphabet of size 2, one can ask related questions on expressiveness and capacity for 
higher alphabet sizes and duplication lengths. However, counting the number of length-n sequences for capacity 
calculation and characterizing fully expressive systems for larger alphabets are often not straightforward tasks. In 
this paper, we study these questions and develop methods to answer them. 

A useful tool in this study is the theory of finite automata. As a simple example note that the string system over 
binary alphabet in the preceding example can be represented by the hnite automaton given in Figure The regular 
expression for the language dehned by the finite automaton is 

i?oi = (0+1+)+, (1) 


which represents all binary strings that start with 0 and end with 1. Here, for a sequence s, s+ denotes one or more 
concatenated copies of s. 
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One can use the Perron-Frobenius theory 0,0 to count the number of sequences which can be generated by a 
hnite automaton. This enables us to use hnite automata as a tool to calculate capacity for some string duplication 
systems with tandem repeats over larger alphabet. 

In our results, we find that the exact capacity of the tandem duplication string system over ternary alphabet 
with seed 012 and duplication length at most 3 equals log 3 — 0.876036. Moreover, we generalize this result 
by characterizing the capacity of tandem duplication string systems over an arbitrary alphabet and a seed with 
maximum duplication length of 3. Namely, we show that if the maximum duplication length is 3 and the seed 
contains abc as a substring, where a, b, and c are distinct symbols, then the capacity ~ 0.876036 logj^i 3. If such 
a substring does not exist in the seed, then the capacity is given by logj^i 2, unless the seed is of the form a™, in 
which case the capacity is 0. Some of these results are highlighted in Table [I] 

Our next example presents a system that, unlike that of Example is not fully expressive. 

Example 2. Consider a tandem duplication string system over the ternary alphabet {0,1,2} with seed 012 and 
maximum duplication length 3. This system is not fully expressive as it cannot generate 210, 102, or 021, even 
as a substring. It is not difficult to see that to generate any of these strings, at least one of the other two must be 
already present as a substring of the seed. Since 012 does not contain any, by induction, it follows that the system 
is not fully expressive. □ 

Based on the previous example, one may ask what happens if we start with a seed that contains one of the strings 
210, 102, or 021, e.g. if we let the seed be 01210? Does the system become fully expressive? While this system 
can generate all strings of length 3 as substrings, the answer is still no as shown in Theorem 0 Regardless of the 
seed, a ternary system with maximum duplication length of 3 is not fully expressive. We show in Theorem 0 that 
a maximum duplication length of at least 4 is needed to arrive at a fully expressive ternary system. 

While for alphabets of size 2 or 3, increasing the maximum length on duplications turns a system that is not fully 
expressive to one that is, for alphabets of size 4 or more, these systems are not fully expressive regardless how large 
the bound on duplication length is. The main tool in constructing quaternary strings that do not appear independently 
or as substrings in these systems is Thue’s result proving the existence of ternary square-free sequences of any 
length. Note that unary and binary square-free sequences of arbitrarily large length do not exist. The existence 
of such sequences underlies the signihcant shift in the behavior of tandem duplication systems with regards to 
expressiveness as a function of alphabet size. Some of our results on expressiveness are summarized in Table [0 

As part of this paper, we also study regular languages for tandem duplication string systems. In 0, it was shown 
that the tandem duplication string system is not regular if the maximum duplication length is 4 or more when 
the seed contains 3 consecutive distinct symbols as a substring. However for maximum duplication length 3, this 
question remained open. In this paper, we show in Theorem0that if the maximum duplication length is 3, a tandem 
duplication string system is regular irrespective of the seed and the alphabet size. Moreover, we characterize the 
exact capacity for all these systems. 
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s 

s 

k 

Capacity 

{0,1,2} 

012 

3 

~ 0.876036 

arbitrary 

xabcy 

3 

~ 0.876036 logjj^l 3 


TABLE I 

Capacity values tandem duplication string systems (E, Here x,y & S*, and a,b,c&T, are distinct. 


S 

s 

k 

fully expressive 

{0,1,2} 

arbitrary 

< 3 

No 

{0,1,2} 

012 

> 4 

Yes 

Size > 4 

arbitrary 

arbitrary 

No 


TABLE II 

Expressiveness of tandem duplication string systems (s, s, 


Related Work: Tandem duplications have already been studied in |[T|, Q, Q. However the main concern of 
these works is to determine the place of tandem duplication rules in the Chomsky hierarchy of formal languages. 
A study related to our work can be found in 0, ©• String systems with different duplication rules namely - end 
duplication, tandem duplication, reversed duplication and duplication with a gap are defined and studied in |^. 
In end duplication, a substring of certain length k is appended to the end of the previous string - for example, 
A CT GT —> ACTGT CT . In reversed tandem duplication, the reverse of a substring is appended in tandem in the 
previous string - for example, ACTGT —> ACT TC GT. In duplication with a gap, a substring is inserted after a 
certain gap g from its position in the previous string - for example A CT GT ACTG CT T. 

For tandem duplication string systems, the authors in Q show that for a fixed duplication length the capacity is 
0. Further, they find a lower bound on the capacity of these systems, when duplications of all lengths are allowed. In 
this paper, we consider tandem duplication string systems, where we restrict the maximum size of the block being 
tandemly duplicated to a certain length. In Q, the authors show that for these bounded tandem duplication 
string systems if the maximum duplication length is 4 or more and the alphabet size is more than 2, the system 
is not regular for any seed that contains 3 consecutive distinct symbols as a substring. However for maximum 
duplication length 3, this question was left open. In this paper, we show in Theorem that the language is regular 
for maximum duplication length 3 irrespective of the seed and the alphabet size. We also characterize the exact 
capacity of these systems. 

In the rest of the paper, the term tandem duplication string system refers to these kind of string duplication 
systems with bounded duplication length. 

The rest of the paper is organized as follows. In Section]^ we present the preliminary definitions and notation. In 
Section |I^ we derive our main results on capacity and expressiveness. In Section [TVl we show that if the maximum 
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duplication length is 3, then the tandem duplication string system is regular irrespective of the seed and alphabet 
size. Further, using the regularity of the systems, we extend our capacity results. We present our concluding remarks 
in Section IV] 


II. Preliminaries 


Let S be some finite alphabet. An n-string x = X1X2 • • • Xji S E" is a finite sequence where € E and \x\ = n. 
The set of all hnite strings over the alphabet E is denoted by E*. For two strings x € E” and y G E"*, their 
concatenation is denoted by xy G E"+™. For a positive integer m and a string s, s'" denotes the concatenation 
of m copies of s. A string v G E* is a substring of x if x = uvw, where u,w G E*. 

A string system S' C E* is represented as a tuple S = (E,s,T), where s G E* is a hnite length string called 
seed, which is used to initiate the duplication process, and T is a set of rules that allow generating new strings 
from existing ones |[^. In other words, the string system S = (E,s,T) contains all strings that can be generated 
from s using rules from T a hnite number of times. 

A tandem duplication map Ti^k, 

{ uvvw, X = uvw, |u| = i, lul = k, 

X, else, 

creates and inserts a copy of the substring of length k which starts at position i + 1. We use : E* —E* and 
7^^” to denote the set of tandem duplications of length k, and tandem duplications of length at most k, respectively, 

= {T,,k : i G N}, 


TlT = : riJ e N,j < k}. 

With this notation, the system of Example [^can be written as ({0,1}, 01, T^). 
The capacity of the string system S = (E, s,T) is dehned as 

iog|E| \s n E"| 


cap(5') = limsup ■ 

n—><x> 


( 2 ) 


Furthermore, it m fully expressive if for each y G E*, there exists d. z G S, such that y is a substring of z. 


III. Capacity and Expressiveness 

In this section, we present our results on the capacity and expressiveness of tandem duplication system with 
bounded duplication length. The section is divided into two parts; the hrst part focuses on capacity and the second 
on expressiveness. 


A. Capacity 

Our hrst result is on the capacity of a tandem duplication string system over ternary alphabet. 
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Fig. 2. Finite automaton for S = ({0,1,2}, 012, T^g"'). 

Theorem 1. For the tandem duplication string system S = ({0,1, 2}, 012, we have 

cap(«S') = logg ^ “ 0.876036. 

Proof: We prove this theorem by showing that the hnite automaton given in Figure accepts precisely the 
strings in S, and then hnding the capacity using the Perron-Frobenius theory 0, 0. 

The regular expression R for the language dehned by this finite automaton is given by 

i?= (0+1+)+2+(1+2+)*[0+(2+0+)*1+(0+1+)*2+(1+2+)Y- 0) 

Let Lr be the language dehned by the regular expression R (and by the hnite automaton). We hrst show that 
Lfi C S. The direct way of doing so is to start with 012 and generate all the sequences in Lr via duplications. For 
simplicity of presentation, however, we take the reverse route: We show that every sequence in R can be transformed 
to 012 by a sequence deduplications. A deduplication of length k is an operation that replaces a substring aa by 

dd<k 

a if \a\ = k. For two regular expressions Ri and R 2 , we use Ri —^ R 2 to denote that each sequence in Ri 
can be transformed into some sequence in i ?2 via a sequence of deduplications of length at most k. 

Note that R = B 1 B 2 *, where 

Bi = (0+l+)'^2+(l+2+)*, 

B2 = 0+(2+0+)*l+(0+l+)*2+(l+2+)*. 

We have Bi —^ 012(12)* —^ 012, since a’*' —^ a and {ab)~^ —^ ab for all a,b £ Y,. Furthermore, 

B 2 0(20)*1(01)*2(12)* 0(20)*1(01)*2 0(20)*12 {02012,012}. (4) 

/ \ ^ / \ ^ djd. 2 y ^ 

Note for example that 1(01) 2(12) —^ 1(01) 2 as the underlined 2 is always preceded by a 1. 

We thus have R = B^B* {01202012,012012} 012, proving that Lr C S. 
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To complete the proof of = S, we now show that S C L^. In what follows, we say a hnite automaton 
generates a sequence s, if there is a path with label s from Start to an accepting state. If an automaton generates 
uvw, with u,v,w G S*, we may use v to refer both to the string v itself and to the part of the path that generates 
V. The meaning will be clear from the context. 

We show S C Lfi, by proving the following for the hnite automaton in Figure 

i) It can generate 012. 

ii) If the automaton can generate pqr, with p,q,r G S* and |g| < 3, it can also generate pq^r. 

Condition i) holds trivially (see the path Start — Si — S 2 — S^ in Figure |^. In order to prove ii), we dehne; 

• Path Label: Given a path a in a hnite automaton, the path label la G E* is dehned as the sequence obtained 
by concatenating the labels on the edges forming the path. 

• Path Length is the number of edges of the path. 

• Superstate: A state D is a superstate of a state C if for each path starting in C and ending in an accepting 
state, there is a path with the same label starting in D and ending in an accepting state. Note that every state 
is a superstate of itself. 

• Duplicable Path: A path ending in a state C is duplicable if there is a path with the same label starting in C 
and ending in a superstate of C. 

Suppose a hnite automaton can generate pqr. If q is duplicable, then pq^r can also be generated by the hnite 
automaton. As a result, to prove ii), it suffices to show that for each state C in Figure all paths of length 1, 2 
or 3 ending in C are duplicable. 

The rest of the proof is divided into two parts. In Part 1, we show that all paths ending in {S'4, S^, Sq, T 4 , T 5 , Tg} 
with length < 3 are duplicable. In Part 2, we prove the same statement for the states {Si, S 2 , S 3 ,T 2 ,T^}. Note 
that there are no nontrivial paths ending in the Start state. 

Part 1 : Given a state u and j G {1, 2,3}, let P“ be the set of all length-j paths ending in u and let Q{- be the 
set of all length-) paths starting and ending in u. If 


U U (5) 

a e a e QJ 

then all length-) paths ending in u are duplicable. 

We prove that (j^ holds for all states {S4, S5, Sq,T4,T5,Tq} and all ) G {1,2,3}. This is done by computing Ai, 
Af and Af, where Ai is the (labeled) adjacency matrix of the strongly connected component of the hnite automaton 
given in Figure]^ i.e. the subgraph induced by {S'4, S'5, iSg, T4, T5, Tg}. Here in computing the matrix products, 
symbols do not commute, e.g. xy ^ yx. The adjacency matrix Ai and its square Af, where x, y and 2 : represent 
edges labeled by 0, 1, and 2, respectively, and where rows and columns correspond in order to S4, S5, Sg, T4, Tg, Tg, 


are given by 


Ai 


X y 0 z 0 0~ 

0 y z 0 X 0 
X 0 z 0 0 y 
ai 0 0 2 0 0 ’ 

0 y 0 0 rc 0 
0 0 2 0 0 y_ 
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x^-l-zx 

y'^+xy yz 

Z^^+XZ 

yx 

0 

zx 

y'^+xy z^+yz 

0 

x‘^-\-yx 

zy 

x^-\-zx 

xy A+yz 

xz 

0 

y^-\-zy 

x^-\-zx 

xy 0 

z"^ -\-xz 

0 

0 

0 

y'^+xy yz 

0 

x"^ -\-yx 

0 

zx 

0 z'^+yz 

0 

0 



Each entry in these matrices lists the paths of specific length from the state identified by its row to the state identified 
by its column. For example, the entry ( 6 , 3) of Af, which equals + yz, indicates that there are two paths of 
length 2 from Tg to Sq with labels z^ = 22 and yz = 12 . 

For a state u G {iS' 4 , S' 5 , Sg, T 4 , Tg, Tq}, the terms in the column that corresponds to u in these matrices represent 
the labels of the paths of the appropriate length that start in S 4 , S 5 , Sq,T 4 ,T^, or Tg and end in u. Furthermore, for 
every path that starts in S' 2 , 5 * 3 , T 2 , T 3 } and ends in u, there is a corresponding path with the same label that 
starts in {S' 4 , S 5 , S'g, T4, Tg, Tg} and ends in u-this path can be obtained by replacing Si with ^4, S 2 with S 5 , S 3 
with S'g, T 2 with Tg and T3 with Tg. Finally, there are no paths of length at most 3 from Start to u. Hence, the 
terms in the column corresponding to u in the matrix Al, i G {1, 2, 3}, contain the labels for all paths of length 
i that end in u. On the other hand, the terms in the diagonal element in this column correspond to labels of the 
paths that start and end in u. 

It thus follows that to check 0 , we need to verify that the nonzero terms in the non-diagonal elements of each 
column also appear in its diagonal element. For Ai and Ai, this can be easily done by observing the matrices. For 
example, the entry (3, 3) of Af equals + yz and contains all terms appearing in column 3 of Af, which are yz 
and -f yz. We verified using a computer that Af also satisfies the same condition. Hence, we have shown that 
all paths of length at most 3 ending in {S' 4 , 5'g, S'g, T 4 , Tg, Tg} are duplicable. 

Part 2 : Now, we prove that all paths of length at most 3 ending in jSi, S 2 , S 3 , T 2 , T 3 } are duplicable. We first 
show that 0 holds for all states G (Si, S 2 , T 2 , T 3 } for paths of length < 3, and also holds for S 3 for paths of 
length 1 and 2. Next, we show that while Q does not hold for paths of length 3 for S 3 , all length-3 paths ending 
in S 3 are still duplicable. 

Observe that there is no path of any length from any state G {S 4 , Sg, Sg, T 4 , Tg, Tg} to any state G {Start, Si, S 2 , S 3 , T 2 , T 3 }, 
hence we only need the (labeled) adjacency matrix A 2 of the subgraph induced by {Start, Si, S 2 , S 3 , T 2 , T 3 }. We 




'0x 0 0 0 0' 





0 X y 0 0 0 



-^2 

= 

0 0 y 2 X 0 

0 0 0 2 0 y 

0 0 y 0 X 0 





.0 0 0 2 0 y. 



0 x^ 

xy 

0 0 


0 

0 x^ 

xy 

yz yx 

0 

0 0 

y^+xy z^+yz x^+yx 

zy 

0 0 

0 

z^+yz 0 


y^+zy 

0 0 

y^-\-xy yz x^-\-yx 

0 

0 0 

0 

z^+yz 0 


y^+zy _ 


where rows and columns correspond to Start, Si, S 2 , S 3 , T 2 , T 3 , in that order. We observe that in A 2 and 
A 2 , in each of the columns corresponding to Si, S 2 , S 3 , T 2 , and T 3 , the terms in the diagonal entry contain 
the terms appearing in that column, implying that (|^ holds for all u G (Si, S 2 , S 3 , T 2 , T 3 } and j G {1,2}, i.e. 
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for paths of length 1 and 2. By computing A 2 using a computer, it can be checked that (j^ holds for all states 
u € (Si, S 2 ,T 2 ,T 3 } for paths of length 3 as well. 

For S 3 , there is a length-3 path — S '2 — S 3 with label 012, for which there does not exist a corresponding 

path with the same label which starts and ends in S 3 . Due to this fact Q does not hold for S 3 for paths of length 
3. But for this length-3 path, we can traverse S 3 — S 4 — S 5 — Sq which also has label 012. Now, since Sq is a 
superstate of S 3 , the path 012 starting in Si and ending in S 3 is duplicable. The other length-3 paths ending in S 3 
are 112, 122, 222 and 212. For each of these 4 paths, there exists a corresponding path with the same label that 
starts and ends in S 3 (see Figure]^. Hence, all length-3 path ending in S 3 are duplicable. This completes the proof 
of S' C Lr. 

Now that we have shown S = Lr, we use the Perron-Frobenius Theory to count the number of sequences 

which can be generated via this deterministic finite automaton. We calculate the maximum absolute eigenvalue e* 
of the (unlabeled) adjacency matrix B of the strongly connected component of the finite automaton in Figure 
(i.e. the subgraph induced by S 4 , S 5 , Sq, T 4 , T 5 , Tg). The matrix B can be obtained by replacing x, y, and z in Ai 
by 1 , 

r 1101 0 0" 

011010 
R _ 101001 

^ — 100100 ■ 

010010 
Lo 0 1 0 01J 

The maximum absolute eigenvalue of B is e* = — 2.618034. By the Perron-Frobenius Theory, cap(S) = 

logs e* ~ 0.876036. ■ 

While the proof of the preceding theorem providing the exact capacity of the system under study is somewhat 
involved, it is easy to see why the capacity is strictly less than 1. One can observe from the regular expression for 
the finite automaton that it cannot generate a string which has 210 , 021 or 102 as a substring, implying that the 
system is not fully expressive. As we will see in Lemma such systems cannot have capacity 1. It is worth noting 
that the set of strings that avoid 210, 021, and 102 can be shown to have capacity ~ 0.914838, which is slightly 
larger than the capacity of the system of the theorem. 

B. Expressiveness 

We now turn to study the expressiveness of tandem duplication systems with bounded duplication length. For 
completeness we start with binary systems, which is indeed the simplest case. 

Lemma 3. The system S = ({0, 1}, for any s is not fully expressive. 

Proof: The system cannot generate (01)™ as a substring of any string in S for 2m > |s|. ■ 

As shown in Example[T] to obtain fully expressive binary systems, it suffices to increase the maximum duplication 
length to 2. 

The next theorem is concerned with the expressiveness of ^ = ({0,1, 2}, s, 7 ^ 3 ”). Larger alphabets and larger 
duplication lengths are considered in Theorems [^and|^ 
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Theorem 2. Consider S = ({0,1, 2}, s, T^g"), where s is any arbitrary starting string s G {0,1,2}*. Then, S is 
not fully expressive. 

Proof: A k-irreducible string is a string that does not have a tandem repeat aa, such that |a| < k. For example, 
01201, 01210, 02101, and 01210121 are 3-irreducible strings, while 01212, 021021 and 01112 are not 3-irreducible. 
To prove the theorem, we identify certain properties in new 3-irreducible strings that may appear after a duplication 
and then construct a 3-irreducible string that is neither a substring of s, nor it satisfies the properties that every 
new 3-irreducible substring must satisfy. 

Consider a duplication event that transforms a sequence z = uvw to z* = uvvw, where |u| < 3. Let a: be a 
3-irreducible string of length at least 4 that is present in z* but not in z. The string x must intersect with both 
copies of V in z* or else it is also present in z. Furthermore, it cannot contains vv, since otherwise it would not 
be 3-irreducible. To determine the properties of x, we consider three case: |t!| = 1,2,3. In what follows assume 
Ol, 02,03 e E. 

First, suppose |r!| = 1, say v = Oi. In this case, a string x with the aforementioned properties does not exist as 
all new substrings contain the square aiui. 

Second, assume |z;| = 2, say v = aia 2 . Then z* = uaia 2 aia 2 W and x either ends with 010201 or starts with 
020102 . 

Third, suppose |u| = 3, say v = O 1 O 2 O 3 . So z* = 0010203010203111 . Recall that |x| > 4. The string x either 
ends with 01020301 or 02030102 , or starts with 02 O 3 O 1 O 2 or 03 O 1 O 2 O 3 . 

So for any new 3-irreducible substring x = xi - ■ ■ Xj, Xi G j > 4, we have xi = X3, Xi = Xa, Xj = Xj-2, or 
Xj = Xj- 3 . Now consider the string (0121)^0, where I > |s|. This sequences is 3-irreducible but does not satisfy 
any of the 4 properties stated for x. Since it is not a substring of s and it cannot be generated as a new substring, 
it is not a substring of any y G S. ■ 

Next we consider the system |E| > 4 in Theoremj^ The proof of the theorem, uses the following 

lemma, which states that the expressiveness of a system also has a bearing on its capacity. 

Lemma 4. If a string system S with alphabet E is not fully expressive, then cap(S') < 1. 

Proof: Since S is not fully expressive, there exists a z £ E* that does not appear as a substring of any y G S. 

Let |z| = m and p. = n — m[^J. We have 

IS'nE"! < (|Er-i)L^J|E|'". 

Since m is finite, cap(iS') <1. ■ 

Theorem 3. Consider S = s,T^ff), where |E| > 4, s is any arbitrary seed G E* and k is some finite natural 
number, then S is not fully expressive, which also implies cap(>5') < 1. 

Proof: Suppose z = uvw G S, where |i;| < k, and let z* = uvvw be the result of a duplication applied to 

z. Furthermore, suppose that x = xi - ■ ■ Xj, where Xi G T, and j > fc, is a square-free substring of z* but not 
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z. Similar to the proof of Theorem x intersects both copies of v but does not contain both. As a result, either 
xi = Xi+i or Xj = Xj-i, for some 2 < i < k. 

For definiteness assume E contains the symbols {0,1, 2, 3}. The sequence OfO, where f is a square-free sequence 
over the alphabet {1,2,3} and |f| > max{|s|,fc}, is not a substring of s and cannot be generated as a substring 
since it does not satisfy the conditions stated for x above. Note that such a t exists since as shown by Thue 
for an alphabet size > 3, there exists a square-free string of any length. Hence S is not fully expressive. The second 
part of the theorem follows from Lemma ■ 

Theorem 4. Consider S = ({0,1, 2}, 012, then S is a fully expressive string system. 

Proof: Let S' = ({0,1, 2}, 012, T^g"). Clearly, S' C S. From the proof of Theorem we know that the 
automaton of Figure gives the same language as S'. By checking this automaton, we find that all strings of 
lengths 1, 2, and 3, except 021, 210, and 102, appear as a substring of some string in S' and, as a result, some 
string in S. To generate 021, 210, and 102 as substrings of some string in S, we proceed as follows; 

012 ^ 0 1212 ^ 01210121 2 
012 ^ 012012 ^ 01202012 ^ 0 12021202 012 
012 ^ 012012 ^ 01202012 ^ 012 02010201 2 

where the repeats are underlined. 

We have shown that all strings of length 3 appear in S as substrings. Now we show the same for every string 
w = W 1 W 2 W 3 W 4 of length 4. To do so, we study 3 cases based on the structure of w: 

I) First, suppose that W 4 is the same as wi, W 2 , or W 3 . For generating such u; as a substring, we hrst generate 
w' = W 1 W 2 W 3 as a substring of some string and then do a tandem duplication of W 3 if W 4 = W 3 , of W 2 W 3 if 
W 4 = W 2 and of W 1 W 2 W 3 if W 4 = Wi. 

II) Suppose I) does not hold but wi = W 2 or W 2 = W 3 . If the former holds, first generate W 1 W 3 W 4 and then 
duplicate wi, and if the latter hold, generate W 1 W 2 W 4 and duplicate W 2 . 

III) If neither I) nor II) holds, then w = 1210, up to a relabling of the symbols. In this case, we hrst generate 
w' = 0121 and then do a tandem duplication of w' to get w. Note that w' is of type considered in I). 

Until now, we have shown that all strings w of length at most 4 appear as a substring of some string in S. We 
use induction to complete the proof. Suppose all strings of length at most m appear as a substring of some string 
in S, where m > 4. We show that the same holds for strings of length m + 1. 

Consider an arbitrary w = 0402 ■ ■ ■ amCtm+i- We now consider two cases: 

i) If all three letters in the alphabet occur at least once in am- 3 CLm- 2 Cim-iCim, then Om+i equals am- 3 , am- 2 , 
Qm-i, or ttm, and w can be generated as a substring by a tandem duplication of some suffix of size < 4 of 
w' = 0102 • • • Om- Note that by the induction hypothesis w' can be generated as a substring of some string. 
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s 

s 

k 

fully expressive 

Reason 

{0} 

0 

> 1 

Yes 

Trivial 

{0,1} 

arbitrary 

1 

No 

Lemma 3 


{0,1} 

01 

> 2 

Yes 

Example 


{0,1,2} 

arbitrary 

< 3 

No 

Theorem 

2 

{0,1,2} 

012 

> 4 

Yes 

Theorem 

i 

|S| >4 

arbitrary 

arbitrary 

No 

Theorem 



TABLE III 

Expressiveness of tandem duplication string systems (S, s, 


ii) If at least one letter in the alphabet does not occur in then am-z 0 ,m- 20 ,m-i 0 ,m is a 

sequence over binary alphabet and so it has a tandem repeat. Therefore w can be generated as a substring by 
tandem duplication. Hence, we have proved the Theorem. ■ 

Table summarizes the result of this subsection. It can be observed from the table that a change of behavior 
in expressiveness occurs when the size of the alphabet increases to 4. If the size of the alphabet is 1, 2, or 3, for 
sufficiently large maximum duplication length, the systems are fully expressive. However, if the size of the alphabet 
is at least 4, then regardless of the maximum duplication length, the system is not fully expressive. This change is 
related to the fact that for alphabets of size 1 and 2, all square-free strings are of finite length, but for alphabets 
of size 3 and larger, there are square-free strings of any length. Specifically, in case ii) in the proof of Theorem 
we used the fact that the binary string ara- 3 am- 2 Cim-io-m has a tandem repeat. To adapt this proof for |S| > 4, 
we would need to show that the |S| — 1-ary string am-i(im- 2 <im-i<im has a tandem repeat. But this is not in 
general true, since there are square-free strings over alphabets of size at least 3 per Thue’s result and indeed 
we showed in Theoremj^ again using Thue’s result, that the system (E, s, 7^^") is not fully expressive for |E| > 4 
and any k. 


IV. Regular Languages for Tandem Duplication String Systems 
Regular languages for tandem duplication string systems are easier to study due to the fact that one can use tools 
from Perron-Frobenius theory |5), @ to calculate capacity. It was proved in |j^ that for |E| > 3 and maximum 
duplication length > 4, the language defined by tandem duplication string systems is not regular, if the seed contains 
abc as a substring such that a, b and c are distinct. However, if the maximum duplication length is 3, this question 
was left unanswered. In Theorem we show that the language resulting from a tandem duplication system with 
the maximum duplication length of 3 is regular regardless of the alphabet size and seed. Further, in Corollary we 
characterize the exact capacity of such tandem duplication string systems. 

Theorem 5. Let S = ill, s,T^^), where S and s are arbitrary. The language defined by S is regular. 

Proof: We first assume that s = ai • • • am, where are distinct. The case in which Oi are not distinct is 
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E 

s 

k 

Capacity 

{0,1} 

01 

1 

0 

{0,1} 

01 

> 2 

1 

arbitrary 

arbitrary but not for some a E S 

2 

log|E| 2 

{0,1,2} 

012 

3 

logs ^+2^ 

arbitrary 

xabey (x and y E S*, a, b and c E S and a ^ b ^ a) 

3 

l°g|E| ^ 

arbitrary 

No 3 consecutive symbols in the seed are all distinct and s 7 ^ for a E S 

3 

log|E| 2 


TABLE IV 

Capacity values for different tandem duplication string systems (S, s, 


handled later. 

For S < j < m, let 


Rai-aj — ^2) ) Raic 


24 (aj at 


Tb, 


ai (o-i-iOi ) Sai_20i_ic 


where, for a,b,c G E, 




Babe = a+(c+a+)*5+(a+5+)*c+(&+c+)*. 


We already know from Theorem that S = (E, s, 7 ^ 3 ") with s = oi • • • am is a regular language if to = 3. We 
show that for to > 4, S' represents a regular language whose regular expression is given by Raia^ -a^ - Let La be 


the language defined by i?aia 2 - a„- L suffices to show Ln = S. 

dd <3 

(1‘2 ■ ■ ■ Q^rn 


> s. To do so, we show by induction that Ra^o 


dd< 


We first show that Lfi C S by proving Ra 
0102 • • • Oi. First note that this holds for z = 3, from the proof of Theorem[2 Assuming that it holds for i, to show 
that this also holds for z + 1, where z > 3. We write 

dd <3 


R, 


-^Ra 


dd<3 


(n + n+ fl " 

v“i “i+1/ ^ai_iaiai+i 


> 0102 • ■ • OiOi+i(Oiai+i) i?ai_i 




dd< 


4 0102 • ■ • OiOi+i(oi_iOiOi+i) or 
O1O2 • • ■ OiOi+i(Oi_iOi+iOi_iOiOi+i) 


dd< 


-4 O1O2 • • ■ OiOi+i- 


ddc 


Here we have used the fact that cBatc -^ cabc which follows from Hence, L^i C S. 
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We now show that S C L^. Note that the seed s is in L^. It thus suffices to show that if a; = pqr G Lp>, then 
y = pq^r G L_r, where p,q,r G S* and jgl < 3. We have the following five cases: 

1) q = b, q = bb or q = bbb, for some 6 G S: Since each symbol in the regular expression Ra^-.-am is followed 
by a + or * as a superscript, if q represents a run and pqr G Ln, then so is pq'^r. 

2) q = be for distinct h, c G S : Here q represents a length-2 path in the finite automaton for a regular expression 
of the form (6+C+)*, 5+(c+&+)*, 5+Scab, Babe, Bbca, Bbac, Bcab, Bacb Of b+c"*". We know from the proof of 
Theoremthat be is duplicable in (fe+c"*")*, Babe, Bbca, Bbac, Bcab and Bacb- For 6+(c+6“'')* and b^Bcab, 
we enter a state in the finite automaton for {c'^b'^)* and Bcab respectively with incoming edge labeled by c. 
In this state, we can again duplicate path be and return back to the same state. 

The finite automaton for fc+c'*' is followed by the finite automaton for (fe+c"*')*, so be can be duplicated in 
the automaton for (6“''c+)*. The duplicate q = be generated here in (6“''c+)* ends in some state C which is a 
superstate of the state D in which the original q in pqr ended. Since C is a superstate of D, r can also be 
generated from C. Hence pq'^r G Ln. 

3) q = bbc or bee for distinct &, c G S : Here q represents a length-3 path. We only consider q = 66c; the other 
case is similar. If pbber G L_r, then pber G Lr as well, since every symbol in i?ai...a„ is followed by a -f or 
* as a superscript. Now we already know from case 2 above if pher can be generated then pbeber can also 
be generated. Now from case 1 above, we also know if pbeber can be generated then pbbeber can also be 
generated. Further using case 1 again, we can generate pbbebber from pbbeber. Hence pq^r G L_r. 

4) q = abe for distinct a, 6, c G S : Here q represents a length-3 path in the finite automaton for Bcr[abc) {cr{abe) 

represents any permutation of a,b,e), a“''(6+c“'')*, o^Bbea, (a'''6+)*c''', Bj^abC^, Bcab or a“*'6+c“''. 

We know from the proof of Theoremj^that abe is duplicable in Ba(abc)- The same reasoning holds for a'^Bbca 
and {a+b+)*Bcab- 

The finite automaton for a“''(6+c“'')*, (a“*'6+)*c“*', Bj^abC^ and a“''6'''c“'' is followed by a finite automaton for 
Babe, so q can be duplicated in the finite automaton for Babe- The duplicate q ends in some state E which is 
the superstate of the state F in which the original q in pqr ended. Since, E is a superstate of F, therefore r 
can also be generated from E. Hence pq^r G Ln. 

5 ) q = ebe for distinct b,e G E : Here q represents a length-3 path that can be generated by the finite automaton 
for (c“''6+)*, (6+c“'')*, Ba^cba), e'^{e'^h'^)* or c+6“*'(c+6+)*. We know from the proof of Theorem 1 that ebe 
is duplicable in (c'''6+)*, (6'''c+)* and Ba{cba)- As the state where q in pqr ends lies in the finite automata 
for either {e'^b'^)*, {b'^e'^)*orBa(^cba), it c™ be duplicated again the same finite automaton. The duplicate q 
ends in the superstate of the state in which the original q in pqr ended. Hence pq^r G Lfj. 

This completes the proof of S' C Lji. 

We have proved the statement of Theorem|^assuming all a^’s in the seed s to be distinct. Now assume the symbols 
of s are not distinct. We color the symbols of s so that they become distinct and obtain the system S = s, 7<3"^. 
Applying the preceding proof for distinct symbols to S, we find that S is regular. Let 6. : E —> E be a mapping 
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that removes the colors. By 111 , we have that S = h{S) is also regular. 

An immediate corollary on the capacity of tandem duplication string system considered in Theorem is 


Corollary 5. If for S in Theorem s contains abc as a substring such that a, b, and c € E are distinct, then 
cap(S') = log|£| — 0-876036 log| 2 | 3. Otherwise, except for the seed of the form a™, cap(5') = logj^i 2. If 
s = a^, cap(S') = 0. 


Proof: By the Petron-Frobenius Theory Q, Q, for a regular language Lr, the capacity is given by the log 
of the maximum eigenvalue of the adjacency matrix of the strongly connected components. In the case when 
abc occurs as a substring of the seed s such that a, b and c S E are distinct, then the adjacency matrix of the 
finite automaton for Babe (strongly connected component of the finite automaton for i?aia 2 -a„) has the maximum 
eigenvalue. Therefore, the cap( 5 ') = logj^i — 0.876036logj^i 3 (see proof of Theorem for the adjacency 
matrix). 

For the case when no 3 consecutive symbols in the seed s are all distinct and s a™, the maximum capacity 
component is a finite automaton only over 2 distinct symbols as in Figure Hence the capacity is logj^i 2. 

When seed s = a™, there is at most one sequence of any given length in the system. Hence cap(S') = 0. ■ 

The following examples illustrate the statement of Theorem and an application of its proof method. 

Example 6. The string system S = ({0,1,2, 3}, 0123, is regular by Theorem and the regular expression 

is given by 

i?oi 23 = 0 + 1 +( 0 + 1 +)* 2 +( 1 + 2 +)*Boi 2 * 3 +( 2 + 3 +)*Bi 23 *. 

By Corollary]^ the capacity of this system ~ 0.876036log 4 3 ~ 0.694242. □ 

Example 7. The string system S = ({0,1, 2}, 0112, T^g") is regular by Theorem and the regular expression is 
given by 

i?oii 2 = 0 +l+( 0 +l+)*l+(l+l+)*Boii* 2 +(l+ 2 +)*Bn 2 *. 

By Corollary 1^ the capacity of this system is given by logg 2 ~ 0.63093. □ 

When Oi’s are assumed to be distinct it can be verified from the regular expression Ra^...aj in the proof of 
Thereomthat the last occurence of is before the first occurence of 0^+3 for any i = 1,2,-- - ,j — 3 for all 
z G S. Motivated by this, we state the following lemma regarding the structure of words in tandem duplication 
systems with bounded duplication lengths 

Lemma 8. Let s = oi • • • Om, where Oi G E are distinct. Then for any z G S = (E,s,7^^") and any i = 
1 ,... ,m — k, the last occurrence of is before the first occurrence of and the gap between them is at least 
k-1 (not counting ai and Oi+fe). 


Proof: Fix the value of i. We prove the lemma by induction. Clearly, the lemma holds for z = s. Assuming 
that it holds for x G S, we show that it also holds for y = T(x) for any T G 7^^". 
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0 12 



Fig. 3. Finite automaton for S = ({0,1, 2}, 012, The regular expression R = 0+1+(0+1+)*2+(1+2+)*. 


Assume x = aaij3ai+kl, where a, G E* and where and ai+k in this expression refer to the last occurrence 
of Qi and the first occurrence of ai+k in x, respectively. Since, by assumption \P\ > k — 1, the tandem duplication 
T cannot contain a substring that contains both the last occurrence of Ui and the first occurrence of Oi+fc. If the 
tandem duplication T duplicates a substring of /3, then the gap between the last Oi and the first Ui+k in y is larger 
than that of x. In every other case, the gap stays the same. So the gap in y is at least as large as the gap in x, 
which is \P\ > k — 1. ■ 

The following example follows for maximum duplication length 2 using the same idea as in Theorem 


Example 9. The string system S = (E, 0102 • • • is regular. This can be proved using the same method 


as used in the proof of Theorem]^ The regular expression Qaia 2 -a„ 

Qaia2'"am = ^3 {(^ 2 ^ 3 ) ' 


for TO > 2 is given by 

l^m) ■ 


□ 


The finite automaton for a special case of Examplewith |E| = 3 is given in Figure]^ 

Corollary 10. The capacity for S = (E, 0102 • • • am, f<^) A given by logj^i 2, except for the case in which seed 
s = a™ for a G Y,. In that case, the capacity is 0. 


Proof: As in Proof of Corollary By the Perron-Frobenius Theory, for a regular language, the capacity is 
given by the log of the maximum eigenvalue of the adjacency matrix of the strongly connected components. Except 
for the case when seed s = a™, for all other cases ab (a, b G Y) occurs as a substring of the seed s such that 
a f b. Hence, the maximum capacity component in the finite automaton for Qaia 2 •■■a„ is for which the 

capacity is logj^i 2. ■ 


Our capacity results are listed in Table IV 
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V. Conclusion 

In this paper, we showed that for tandem duplication string systems with bounded duplication length if the 
maximum duplication length is 3 or less, the language described by the string system is regular. Further, we 
computed exact capacities for these systems. As a future work, we would like to calculate capacities for bounded 
tandem duplication string systems with maximum duplication length greater than 3. 

Using Thue’s result we showed that a tandem duplication string system cannot be fully expressive if the 
alphabet size is > 4. However, for an alphabet of size 3 or less such systems can be fully expressive. This way, we 
completely characterized fully expressive and non-fully expressive tandem duplication string systems with bounded 
duplication length. As a future work, we would like to generalize the notion of expressiveness by counting the 
asymptotic number of substrings of length n that a string system can generate. Mathematically, we define the 
expressiveness Exp{S) of a string system S as 

ExpiS) = limsup '^°^'^'^"^‘^l 

Here En{S) represents the number of substrings of length n that can be generated by S. It is notable here that 
with this definition of expressiveness, a fully expressive string system S has Exp{S) = 1. 

In this paper, we looked at questions related to the generation of a diversity of sequences from a seed given a 
tandem duplication rule. One can also study the minimum number of steps required to deduplicate a given sequence 
of length n to a squarefree seed and therefore define the notion of distance between a sequence and its seed given a 
tandem duplication rule. It is notable here that the same sequence can be deduplicated to more than one squarefree 
seed given a tandem duplication rule. For example: the sequence 012101212 can be deduplicated to 012 as well as 
0121012 under bounded tandem duplication with maximum duplication length 4 in the following way 

ddcA ddcA 

01210121 2 —^ 0 1212 012 . 

012101212 0121012 . 

Here the underlined portion represents the repeat that is being deduplicated in a given step. 
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